library(here)
library(tidyverse)
library(MetabolAnalyze) # contains pareto scaling
library(reshape2) # data manipulation
library(plotly) # interactive plotsDay 2 - Practical 1 - basic metabolomics data analysis
2024 NUTRIM microbiome & metabolome workshop
1 Intro
Today we’re going to look at the metabolomics data, from the study on IBD patients and controls. Note that those data corresponds to the same samples we used yesterday. It’s just another omics, that requires a different approach to data analysis 😉
The data we will use today come from Liquid Chromatography-Mass Spectrometry (LC-MS) profiling of feces’ metabolites and has already been preprocessed to a table with metabolites’ relative concentrations for each sample.
Raw data and chromatograms
Usually, for the detection of metabolites Nuclear Magnetic Resonance or mass spectrometry (MS), coupled to other technologies (like liquid or gas chromatography) is used, depending on the sample characteristics and study aims.
The raw MS data consists of chromatograms showing peaks of intensity of detected metabolites over time (see Figure 1), mass spectrum (see Figure 2) and retention time.
Common preprocessing steps
In short, data processing perform peaks detection, retention time alignment and integration of peak areas. Additionall steps include removing of zero’s and imputation of missing values. However, today we will only focus on the important concepts of the preprocessed metabolomics data analysis.
PS. The detailed data generation protocols can be found here.
In the first part, we’re going to foucus on the basic concepts in metabolomics data analysis: data pretreatment and dimensionality reduction (PCA).
2 Learning goals 💪
- Understand the effect of data pretreatment (normalization, transformation and scaling) on metabolomics data analysis
- Read and inspect the data in R
- Apply different pretreatment methods in R
- Perform PCA analysis and visualization in R
3 Load R packages and some functions 📦
# We also load an R script, with some useful functions we're going to use today.
source(here("functions.R"))4 Read and inspect data 🔍
Read the metabolites’ relative concentrations table, stored in a a .csv file.
metabolites <- read.csv(file = here("data:papa2012/processed/metabolites_short.csv"), row.names = 1)metabolites[1:10, 1:10] V1 V2 V3 V4 V5 V6 V7 V8 V9
CSM5FZ3T 790031 124175 95316 818174 73935019 331452 662877 2229497 44391
CSM5FZ44 1106701 322143 51995 667889 56592824 622792 281607 4308157 39960
CSM5FZ48 14743075 14974 20352 582356 44550963 1199342 3537295 3202186 743898
CSM5FZ4A 5277616 25217 66435 122080 9161552 733011 230294 1022091 147853
CSM5FZ4K 9419959 15874 13002 87326 6254831 135938 83969 1813658 78266
CSM5FZ4O 341092 30670 29092 216458 14236015 429151 318548 2351268 140408
CSM5MCTZ 11487127 229118 121572 1059328 72279000 117107 290841 674867 21303
CSM5MCUM 48781145 134122 148054 536873 41369143 619150 673328 3993021 199937
CSM5MCUU 4460140 180855 124893 149804 11058836 624780 543346 1660035 274899
CSM5MCVN 585886 262394 138361 721657 60910005 167582 219415 2598553 3369
V10
CSM5FZ3T 96982956
CSM5FZ44 84201363
CSM5FZ48 90547462
CSM5FZ4A 4834431
CSM5FZ4K 8305571
CSM5FZ4O 11587139
CSM5MCTZ 55783394
CSM5MCUM 33364290
CSM5MCUU 9015164
CSM5MCVN 88296989
Read the metabolites metadata file, stored in a .cvs file.
metadata_metabolites <- read.csv(file = here("data:papa2012/processed/metadata_metabolites.csv"), row.names = 1)head(metadata_metabolites) SampleID ParticipantID Age Diagnosis Sex
1 CSM5FZ3T C3002 76 CD Female
2 CSM5FZ44 C3002 76 CD Female
3 CSM5FZ48 C3003 43 UC Female
4 CSM5FZ4A C3004 47 UC Female
5 CSM5FZ4K C3003 43 UC Female
6 CSM5FZ4O C3005 76 UC Female
Metabolites
- Samples are stored in rows, e.g.
CSM5FZ3T,CSM5FZ44 - Metabolites are stored in columns, in the form of relative concentrations
- Note, that the metabolites don’t have identifiers. It’s because we have not identified the specific metabolites yet and those variables represent a detected ion with unique properties, based on which we can later try to identify the metabolites by comparing retention times and mass spectra with known standards or databases.
- Additionally, to reduce the computation time we have chosen only a subset of the original metabolites and samples.
Metadata
- Information on sample IDs, participant IDs, age, diagnosis and sex.
- This makes it possible to compare groups and explore the differences and similarities between them.
Now let’s inspect those data more.
How many metabolites are there?
ncol(metabolites)[1] 423
Are there any missing data?
any(is.na(metabolites))[1] FALSE
How many zero’s do we have per metabolite?
colSums(metabolites == 0) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
0 0 11 7 1 11 0 0 1 2 1 6 2 0 7 0
V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
0 2 0 0 0 0 8 10 1 1 3 6 1 1 0 0
V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48
2 10 0 1 6 1 8 2 2 1 7 8 9 0 12 0
V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64
2 12 3 2 3 12 7 3 0 6 2 11 10 8 3 0
V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80
0 9 1 10 5 0 1 0 7 0 11 0 4 0 5 2
V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96
9 8 2 0 3 5 0 0 3 0 0 6 0 8 2 6
V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112
2 1 1 1 0 7 10 12 2 8 10 8 4 8 2 2
V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128
11 4 11 0 8 2 5 4 4 4 6 0 0 1 0 0
V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144
1 0 0 4 0 0 0 0 0 6 2 0 1 12 0 5
V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160
0 6 0 0 0 0 0 6 1 4 0 10 0 3 0 0
V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176
0 0 0 0 6 0 0 11 0 10 1 5 6 0 0 5
V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192
4 7 10 8 3 7 6 0 11 0 10 0 4 6 1 2
V193 V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 V206 V207 V208
11 2 1 5 10 3 0 5 2 2 6 7 0 0 5 0
V209 V210 V211 V212 V213 V214 V215 V216 V217 V218 V219 V220 V221 V222 V223 V224
5 7 4 12 0 5 0 6 8 2 5 3 5 0 2 4
V225 V226 V227 V228 V229 V230 V231 V232 V233 V234 V235 V236 V237 V238 V239 V240
0 0 2 1 10 8 3 0 0 6 8 3 5 1 11 1
V241 V242 V243 V244 V245 V246 V247 V248 V249 V250 V251 V252 V253 V254 V255 V256
3 0 2 11 8 2 9 1 4 9 10 0 2 6 5 4
V257 V258 V259 V260 V261 V262 V263 V264 V265 V266 V267 V268 V269 V270 V271 V272
3 4 1 2 11 10 0 0 0 8 1 0 0 3 0 0
V273 V274 V275 V276 V277 V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288
0 10 0 4 6 5 8 0 0 12 9 0 11 2 3 7
V289 V290 V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 V302 V303 V304
2 2 6 5 10 8 0 0 2 10 11 12 0 3 10 7
V305 V306 V307 V308 V309 V310 V311 V312 V313 V314 V315 V316 V317 V318 V319 V320
6 1 3 7 1 4 7 1 1 4 2 10 6 3 4 0
V321 V322 V323 V324 V325 V326 V327 V328 V329 V330 V331 V332 V333 V334 V335 V336
0 1 7 0 9 0 12 12 2 0 5 1 2 0 11 0
V337 V338 V339 V340 V341 V342 V343 V344 V345 V346 V347 V348 V349 V350 V351 V352
0 0 0 3 1 1 0 1 10 0 7 4 0 0 0 5
V353 V354 V355 V356 V357 V358 V359 V360 V361 V362 V363 V364 V365 V366 V367 V368
4 11 6 1 0 1 0 0 4 1 2 1 2 7 5 5
V369 V370 V371 V372 V373 V374 V375 V376 V377 V378 V379 V380 V381 V382 V383 V384
12 8 9 9 4 1 0 3 8 12 12 1 3 4 11 0
V385 V386 V387 V388 V389 V390 V391 V392 V393 V394 V395 V396 V397 V398 V399 V400
0 9 0 2 8 2 3 5 1 0 5 7 12 6 5 10
V401 V402 V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416
3 12 7 7 1 2 6 0 3 9 5 6 10 0 5 9
V417 V418 V419 V420 V421 V422 V423
12 3 12 6 2 0 7
Inspect the structure of the data. Tip: use view() or structure()
# write your code belowHow many samples with different Diagnosis do we have? Tip: use table()
# write your code belowstructure(metadata_metabolites) SampleID ParticipantID Age Diagnosis Sex
1 CSM5FZ3T C3002 76 CD Female
2 CSM5FZ44 C3002 76 CD Female
3 CSM5FZ48 C3003 43 UC Female
4 CSM5FZ4A C3004 47 UC Female
5 CSM5FZ4K C3003 43 UC Female
6 CSM5FZ4O C3005 76 UC Female
7 CSM5MCTZ C3006 32 UC Male
8 CSM5MCUM C3006 32 UC Male
9 CSM5MCUU C3005 76 UC Female
10 CSM5MCVN C3002 76 CD Female
11 CSM5MCX3 C3006 32 UC Male
12 CSM5MCXF C3012 37 CD Female
13 CSM5MCY8 C3005 76 UC Female
14 CSM5MCZF C3012 37 CD Female
15 CSM67U9T C3015 50 UC Female
16 CSM67U9V C3016 32 CD Female
17 CSM67U9X C3017 45 CD Male
18 CSM67UBH C3002 76 CD Female
19 CSM67UBN C3002 76 CD Female
20 CSM67UBZ C3003 43 UC Female
21 CSM67UDN C3004 47 UC Female
22 CSM67UEA C3005 76 UC Female
23 CSM67UEI C3005 76 UC Female
24 CSM67UFZ C3006 32 UC Male
25 CSM67UGO C3012 37 CD Female
26 CSM67UH7 C3022 69 nonIBD Male
27 CSM79HIT C3016 32 CD Female
28 CSM79HJ6 C3017 45 CD Male
29 CSM79HK1 C3002 76 CD Female
30 CSM79HLM C3003 43 UC Female
31 CSM79HMN C3006 32 UC Male
32 CSM79HOJ C3012 37 CD Female
33 CSM79HQN C3035 62 CD Male
34 CSM79HRG C3031 NA CD Female
35 CSM7KON2 C3012 37 CD Female
36 CSM7KON8 C3012 37 CD Female
37 CSM7KOO9 C3031 NA CD Female
38 CSM7KOOF C3031 NA CD Female
39 CSM7KOP8 C3032 42 UC Female
40 CSM7KOPC C3032 42 UC Female
41 CSM7KOPE C3032 42 UC Female
42 CSM7KOQ7 C3037 46 UC Female
43 CSM7KOQZ C3017 45 CD Male
44 CSM7KORU C3034 36 UC Female
45 CSM7KOTU C3035 62 CD Male
46 CSM7KOU9 C3032 42 UC Female
47 CSM7KOUF C3032 42 UC Female
48 CSM7KOUL C3037 46 UC Female
49 CSM7KOUR C3037 46 UC Female
50 CSM9X1XW C3034 36 UC Female
51 CSM9X1YV C3035 62 CD Male
52 CSM9X1ZC C3031 NA CD Female
53 CSM9X211 C3037 46 UC Female
54 CSM9X213 C3037 46 UC Female
55 CSM9X21J C3034 36 UC Female
56 CSM9X21P C3034 36 UC Female
57 CSM9X22G C3032 42 UC Female
58 CSM9X22Q C3032 42 UC Female
59 CSM9X22U C3037 46 UC Female
60 CSM9X237 C3035 62 CD Male
61 CSM9X23B C3035 62 CD Male
62 CSM9X23L C3037 46 UC Female
63 CSM9X23P C3037 46 UC Female
64 ESM5GEZ6 E5004 7 UC Female
65 ESM5MEBU E5004 7 UC Female
66 ESM5MEC5 E5004 7 UC Female
67 ESM718SY E5004 7 UC Female
68 HSM5FZBJ H4001 14 CD Male
69 HSM5MD3L H4014 10 CD Female
70 HSM5MD41 H4010 13 UC Male
71 HSM5MD43 H4010 13 UC Male
72 HSM5MD4O H4007 15 CD Female
73 HSM5MD4Q H4007 15 CD Female
74 HSM5MD5D H4008 13 nonIBD Female
75 HSM5MD6A H4014 10 CD Female
76 HSM5MD6K H4013 8 nonIBD Male
77 HSM5MD82 H4009 6 nonIBD Female
78 HSM5MD87 H4010 13 UC Male
79 HSM5MD8B H4009 6 nonIBD Female
80 HSM67VFZ H4008 13 nonIBD Female
81 HSM67VG6 H4008 13 nonIBD Female
82 HSM67VGA H4009 6 nonIBD Female
83 HSM67VGG H4009 6 nonIBD Female
84 HSM67VH1 H4010 13 UC Male
85 HSM67VI5 H4024 11 nonIBD Male
86 HSM6XRQE H4019 11 UC Female
87 HSM6XRRD H4019 11 UC Female
88 HSM6XRRV H4014 10 CD Female
89 HSM6XRS8 H4015 15 CD Male
90 HSM6XRSN H4007 15 CD Female
91 HSM6XRSZ H4008 13 nonIBD Female
92 HSM6XRTC H4009 6 nonIBD Female
93 HSM6XRTM H4010 13 UC Male
94 HSM6XRTQ H4010 13 UC Male
95 HSM6XRV2 H4019 11 UC Female
96 HSM6XRVC H4013 8 nonIBD Male
97 HSM6XRVK H4013 8 nonIBD Male
98 HSM6XRVO H4014 10 CD Female
99 HSM6XRVW H4014 10 CD Female
100 HSM7CYYZ H4009 6 nonIBD Female
101 HSM7CYZJ H4022 9 nonIBD Female
102 HSM7CZ1T H4027 15 UC Male
103 HSM7CZ2P H4008 13 nonIBD Female
104 HSM7J4GR H4027 15 UC Male
105 HSM7J4HO H4035 16 UC Male
106 HSM7J4HS H4035 16 UC Male
107 HSM7J4I7 H4024 11 nonIBD Male
108 HSM7J4JT H4040 17 UC Female
109 HSM7J4JZ H4035 16 UC Male
110 HSM7J4K8 H4035 16 UC Male
111 HSM7J4KK H4023 16 nonIBD Male
112 HSM7J4M4 H4019 11 UC Female
113 HSM7J4ME H4019 11 UC Female
114 HSM7J4NE H4027 15 UC Male
115 HSM7J4O1 H4024 11 nonIBD Male
116 HSM7J4O3 H4024 11 nonIBD Male
117 HSM7J4OE H4031 12 CD Male
118 HSM7J4PE H4019 11 UC Female
119 HSM7J4QT H4014 10 CD Female
120 HSM7J4R2 H4013 8 nonIBD Male
121 HSMA33IY H4042 15 UC Male
122 HSMA33J5 H4045 14 nonIBD Female
123 HSMA33JH H4031 12 CD Male
124 HSMA33M8 H4040 17 UC Female
125 HSMA33O1 H4027 15 UC Male
126 HSMA33O5 H4027 15 UC Male
127 HSMA33OR H4035 16 UC Male
128 HSMA33QO H4045 14 nonIBD Female
129 HSMA33SG H4045 14 nonIBD Female
130 HSMA33SK H4045 14 nonIBD Female
131 MSM5LLDA M2008 30 CD Female
132 MSM5LLDS M2008 30 CD Female
133 MSM5LLGH M2014 30 CD Male
134 MSM5LLGR M2028 24 CD Female
135 MSM5LLHQ M2008 30 CD Female
136 MSM5LLHZ M2008 30 CD Female
137 MSM5LLIC M2025 43 CD Female
138 MSM5LLIG M2021 26 CD Male
139 MSM5LLIK M2021 26 CD Male
140 MSM6J2H9 M2039 40 nonIBD Female
141 MSM6J2IQ M2014 30 CD Male
142 MSM6J2J1 M2014 30 CD Male
143 MSM6J2LB M2008 30 CD Female
144 MSM6J2PE M2025 43 CD Female
145 MSM6J2PO M2047 57 nonIBD Male
146 MSM6J2PU M2047 57 nonIBD Male
147 MSM6J2Q3 M2028 24 CD Female
148 MSM6J2QZ M2014 30 CD Male
149 MSM79H7G M2071 26 UC Female
150 MSM79H7M M2071 26 UC Female
151 MSM79H7Q M2047 57 nonIBD Male
152 MSM79H7Y M2047 57 nonIBD Male
153 MSM79H9Q M2061 56 nonIBD Male
154 MSM79HAH M2039 40 nonIBD Female
155 MSM79HAJ M2064 74 UC Male
156 MSM79HAN M2064 74 UC Male
157 MSM79HC8 M2047 57 nonIBD Male
158 MSM79HD6 M2071 26 UC Female
159 MSM79HDA M2069 29 UC Female
160 MSM79HDE M2069 29 UC Female
161 MSM79HDQ M2068 19 CD Female
162 MSM9VZEK M2068 19 CD Female
163 MSM9VZES M2068 19 CD Female
164 MSM9VZEW M2069 29 UC Female
165 MSM9VZFR M2079 29 nonIBD Male
166 MSM9VZGO M2061 56 nonIBD Male
167 MSM9VZHF M2061 56 nonIBD Male
168 MSM9VZHR M2072 51 nonIBD Male
169 MSM9VZIM M2083 25 UC Male
170 MSM9VZIQ M2083 25 UC Male
171 MSM9VZJZ M2075 61 nonIBD Male
172 MSM9VZKE M2079 29 nonIBD Male
173 MSM9VZL9 M2084 23 nonIBD Female
174 MSM9VZLJ M2061 56 nonIBD Male
175 MSM9VZLP M2064 74 UC Male
176 MSM9VZLV M2064 74 UC Male
177 MSM9VZNL M2047 57 nonIBD Male
178 MSM9VZNX M2083 25 UC Male
179 MSM9VZOU M2069 29 UC Female
180 MSM9VZP3 M2069 29 UC Female
181 MSM9VZPL M2085 23 CD Male
182 MSM9VZPV M2060 62 nonIBD Female
183 MSMA267V M2103 35 UC Female
184 MSMA26AL M2068 19 CD Female
185 MSMA26BB M2084 23 nonIBD Female
186 MSMA26BR M2079 29 nonIBD Male
187 MSMA26EJ M2069 29 UC Female
188 MSMAPC5D M2083 25 UC Male
189 MSMAPC5F M2097 21 nonIBD Male
190 MSMAPC62 M2103 35 UC Female
191 MSMAPC6G M2085 23 CD Male
192 MSMAPC6K M2084 23 nonIBD Female
193 MSMAPC7J M2075 61 nonIBD Male
194 MSMAPC7T M2077 32 nonIBD Female
195 MSMB4LXW M2097 21 nonIBD Male
196 MSMB4LYB M2085 23 CD Male
197 MSMB4LYH M2103 35 UC Female
198 MSMB4LZ4 M2071 26 UC Female
199 MSMB4LZ8 M2079 29 nonIBD Male
200 MSMB4LZC M2077 32 nonIBD Female
201 MSMB4LZR M2083 25 UC Male
202 MSMB4LZX M2084 23 nonIBD Female
203 PSM6XBQY P6005 11 CD Male
204 PSM6XBS6 P6009 16 CD Male
205 PSM6XBSI P6010 10 CD Male
206 PSM6XBSS P6005 11 CD Male
207 PSM6XBTF P6009 16 CD Male
208 PSM6XBUG P6010 10 CD Male
209 PSM6XBUO P6010 10 CD Male
210 PSM7J12V P6025 17 UC Male
211 PSM7J134 P6025 17 UC Male
212 PSM7J136 P6005 11 CD Male
213 PSM7J13E P6005 11 CD Male
214 PSM7J141 P6009 16 CD Male
215 PSM7J14L P6010 10 CD Male
216 PSM7J158 P6016 16 CD Male
217 PSM7J15U P6018 17 nonIBD Female
218 PSM7J161 P6028 9 CD Male
219 PSM7J167 P6028 9 CD Male
220 PSM7J177 P6028 9 CD Male
221 PSM7J17F P6028 9 CD Male
222 PSM7J17L P6016 16 CD Male
223 PSM7J188 P6033 15 CD Male
224 PSM7J193 P6013 6 UC Male
225 PSM7J19J P6016 16 CD Male
226 PSM7J1CC P6005 11 CD Male
227 PSM7J1CK P6009 16 CD Male
228 PSM7J1CU P6009 16 CD Male
229 PSMA264Q P6035 16 UC Male
230 PSMA265N P6016 16 CD Male
231 PSMA265T P6016 16 CD Male
232 PSMA265X P6038 16 UC Female
233 PSMA266C P6028 9 CD Male
234 PSMA2675 P6038 16 UC Female
235 PSMA267J P6035 16 UC Male
236 PSMA267P P6035 16 UC Male
237 PSMA269E P6028 9 CD Male
238 PSMA269W P6038 16 UC Female
239 PSMA26A3 P6038 16 UC Female
240 PSMB4MBK P6035 16 UC Male
241 PSMB4MC5 P6038 16 UC Female
The metadata is the same as for the microbiome ! 👀
table(metadata_metabolites$Diagnosis)
CD nonIBD UC
87 55 99
5 Data preatreatment 🔍
Before we can start using some more fancy unsupervised or supervised techniques, we need to understand the importance of data preatreatment, that includes:
transformation, and
We will explore their impact and importance on the data distribution as well as common methods in metabolomics.
5.1 Normalization:
We apply normalization of our data to reduce the effect size of huge variations in the metabolite concentration and make the samples more comparable. Normalization is sample-wise (per sample), whereas scaling or transformation are column-wise (per metabolite).
The structure of the function we have provided is as follows:
pqn(data_matrix)The function accepts a data_matrix and return a pqn normalized matrix, with the same structure (metabolites in columns and samples in rows).
TASK 1
Normalize the data using pqn normalization and assign it to a new variable, e.g. metabolites_pqn.
metabolites_pqn <- pqn(metabolites)For each sample, every metabolite intensity value is divided by the total sum of all metabolite intensity values measured in that sample (NA values ignored by default), before multiplication by 100. The unit is %.
We have provided TAN normalization in a function called normalise_tan(), which is analogical to the pqn function:
normalize_tan(data_matrix)TASK 1
Normalize the metabolites data using TAN normalizatio and assign it to a new variable, e.g. metabolites_tan.
metabolites_tan <- normalize_tan(metabolites)From here, we are going to use pqn normalized metabolites data for consequent analysis.
5.2 Scaling & transformation
Unlike normalization or scaling, data transformation is a nonlinear pretreatment method. It is applied to a normalized data to correct for the data heteroscedasticity and skewness, transforming the distribution closer to the Gaussian one, and stabilizing the variance.
Data scaling aims to correct for between-compounds variation by applying a scaling factor to the variables (which can vary for each metabolite) and give them a comparable influence on data variance.
Moreover, many multivariate methods require scaled or transformed data to work properly. Ther
Let’s inspect how the distribution changes before and after transformation/scaling, based on the first metabolite in our data:
hist(metabolites_pqn[,1],
main = "Original pqn normalized data",
xlab = colnames(metabolites_pqn)[1])Auto scaling is already implemented in the base R, as a scale() function.
TASK 1
Inspect the scale() function. A manual in the ‘Help’ tab of RStudio should open providing explanation and example usage of the function:
# execute the code ----->
?base::scaleTASK 2
Based on the information in the manual answer the following question: what arguments does the scale() function specify and what do they control?
TASK 3
Let’s auto scale the data. For this, we need to specify parameter’s values center=TRUE and scale=TRUE.
autoscaled_metabolites_pqn <- scale(metabolites_pqn,center = TRUE, scale = TRUE)TASK 4
Let’s scale the data and plot the distribution of the first metabolite after auto scaling.
To compute pareto scaling, we can use scaling() function from the MetabolAnalyze package.
TASK 1
Inspect the scaling() function. A manual in the ‘Help’ tab of RStudio should open providing explanation and example usage of the function:
# execute the code ----->
?MetabolAnalyze::scalingTASK 2
Based on the information in the manual answer the following question: How does pareto scaling work?
TASK 3
Let’s scale the data using pareto scaling. We need to specify a value of an argument type.
pareto_metabolites_pqn <- scaling(metabolites_pqn, type = "pareto")TASK 4
Let’s plot the distribution of the first metabolite after pareto scaling.
We have already explored the concept of log transformation yesterday.
To log transform our data, we can simply use log(). By default, log10 is calculated, but we can specify the base it in the base argument of the function:
log(x, base)TASK 1
Let’s perfom log transformation on our data using 10 as a base. PS. Remember to add a pseudo-count to data.
log_metabolites_pqn <- log(metabolites_pqn +1)TASK 2
Let’s plot the distribution of the first metabolite after log transformation.
To square root transform our data, we can simply use sqrt().
TASK 1
Let’s perform square root transformation on the data.
sqrt_metabolites_pqn <- sqrt(metabolites_pqn)TASK 2
Let’s plot the distribution of the first metabolite after log transformation.
6 Principal Component Analysis
We are going to run some PCA analysis on metabolites data and plot the results. It’s a great unsupervised tool to explore the data and identify any potential outliers, clusterings and structure in the data.
Let’s load the useful packages first, factoextra and plotly:
library(factoextra) # PCA functions
library(plotly) # interactive plots6.1 PCA model
To perform PCA, we can use prcomp function, which has the following structure:
prcomp(data,
center = TRUE,
scale = TRUE)Note, that prcomp automatically performs auto scaling, however we can control this with parameters center and scale.
It’s important to auto scale the data before PCA, to give the equal importance to variables in the data, to avoid them “dragging” the decomposition towards some. variables.
TASK 1
Let’s compute our first PCA:
# write your code herepca <- prcomp(metabolites_pqn,
center = TRUE,
scale = TRUE)Inspect the structure of the resulting model. Use str().
str(pca)List of 5
$ sdev : num [1:241] 9.35 5.44 4.7 4.03 3.61 ...
$ rotation: num [1:423, 1:241] -0.0148 -0.0519 -0.0343 0.0706 0.067 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:423] "V1" "V2" "V3" "V4" ...
.. ..$ : chr [1:241] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:423] 16824772 600613 172033 881174 64772837 ...
..- attr(*, "names")= chr [1:423] "V1" "V2" "V3" "V4" ...
$ scale : Named num [1:423] 22453073 1603145 227900 1123252 87005472 ...
..- attr(*, "names")= chr [1:423] "V1" "V2" "V3" "V4" ...
$ x : num [1:241, 1:241] 7.69 9.9 6.19 1.84 6.07 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:241] "CSM5FZ3T" "CSM5FZ44" "CSM5FZ48" "CSM5FZ4A" ...
.. ..$ : chr [1:241] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"
sdev - the standard deviations of the principal components
rotation - variable loadings
x - scores
6.2 Visualizing the results 🎇
We can visualize the percentage of explained varaince for each compount using plot.
Let’s try to plot the score plot of the first PC against the second and visualize the Diagnosis! We also plot the explained variance of those component, by using get_eigenvalue() function on pca model.
Code
ggplot(pca$x, aes(x = PC1,
y = PC2,
color = metadata_metabolites$Diagnosis)) +
geom_point() +
labs(x = paste0("PC1 (",round(get_eigenvalue(pca)[1,2],2),"%)"),
y = paste0("PC2 (",round(get_eigenvalue(pca)[2,2],2),"%)")) +
theme_bw()Let’s look at the score plot, if we would skip auto scaling.
Code
pca_plain <- prcomp(metabolites_pqn,
center = FALSE,
scale = FALSE)
ggplot(pca_plain$x, aes(x = PC1,
y = PC2,
color = metadata_metabolites$Diagnosis)) +
geom_point() +
labs(x = paste0("PC1 (",round(get_eigenvalue(pca_plain)[1,2],2),"%)"),
y = paste0("PC2 (",round(get_eigenvalue(pca_plain)[2,2],2),"%)")) +
theme_bw()The larger variables pull the decomposition towards them, therefore we are not able to distinguish between variables.
6.3 Loadings 🚚
From PCA model we also obtain our loadings. Let’s visualize loadings plots (bar plots) for the first two PC’s:
Code
# First, we need to reshape our data into a long format using melt() function:
loadings <- melt(as.data.frame(pca$rotation) %>%
mutate(metabolite = rownames(pca$rotation)),
id.vars = "metabolite")
ggplot(loadings %>%
filter(variable %in% c("PC1", "PC2")),
aes(x = metabolite, y = value)) +
geom_bar(stat="identity", alpha = 0.8, position = "stack", width = 0.5) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(~variable, ncol = 2)We can set a threshold or choose top x metabolites, that have the biggest influence on the decomposition:
Code
ggplot(loadings %>%
filter(variable %in% c("PC1", "PC2")),
aes(x = metabolite, y = value)) +
geom_bar(stat="identity", alpha = 0.8, position = "stack", width = 0.5) +
theme_bw() +
geom_hline(yintercept = c(0.08, -0.08), linetype="dotted", linewidth = 0.3, color = "red") +
theme(axis.text.x = element_text(angle = 90)) +
facet_wrap(~variable, ncol = 2)To select the metabolites, that are over or below selected threshold, e.g. 0.08, we can type:
important_metabolites <- loadings$metabolite[which(abs(loadings$value) > 0.08 & loadings$variable == "PC1")]
important_metabolites [1] "V166" "V238" "V252" "V257" "V267" "V268" "V269" "V270" "V271" "V325"
[11] "V329" "V346" "V384"
6.4 Biplot
We can also visualize the scores and loadings at the same time! This is called biplot.
For that, we can use biplot, but we are going to use fviz_pca_biplot from the factoextra package,which is an ggplot object:
6.5 Outliers
Notice on our scales PCA plot how some of the samples lay further away from the clump of points. Those points are potential outliers.
We can make an interactive plot and inspect which samples are outlying by pointing a cursor over points. For that, we can use ggplotly function from plotly package.
Code
p <- ggplot(as.data.frame(pca$x), aes(x = PC1,
y = PC2,
z = rownames(pca$x),
color = metadata_metabolites$Diagnosis)) +
geom_point() +
theme_bw()
ggplotly(p)Let’s remove the outliers and see how it changes the results.
Outliers can be due to the errors in sampling, but they also result from an important biological property. Before removing any outliers, it’s important to further check why those samples are different from the rest and decide on what to do with them.
To remove outliers, we can either filter by the scores values, by using which() or directly by names. For example:
metabolites_pqn[which(! metadata_metabolites$SampleID %in% c("PSM7J177", "HSM6XRS8")), ]will filter out the samples, that are not any of the specified ones. Note ! at the beggining of the function, which can be translated to “not”.
Let’s remove samples with PC2 > 20and perform pca on that.
Code
pca_reduced <- prcomp(metabolites_pqn[which(pca$x[,2] <= 20),] ,
center = TRUE,
scale = TRUE)
ggplot(pca_reduced$x, aes(x = PC1,
y = PC2,
color = metadata_metabolites$Diagnosis[which(pca$x[,2] <= 20)])) +
geom_point() +
labs(x = paste0("PC1 (",round(get_eigenvalue(pca_reduced)[1,2],2),"%)"),
y = paste0("PC2 (",round(get_eigenvalue(pca_reduced)[2,2],2),"%)")) +
theme_bw()After removal of those outliers, new outliers can pop out. Remember, after any treatment that changes the distances between samples (e.g. log transformation), the identified outliers can differ.
7 Next ⏭️
Good job! 🎉 You have finished this tutorial! Now you should be able to use different pretreatment methods and apply PCA in R.
At the next session after lunch, we will explore more fancy methods for metabolomics data analysis!
TODO: Click here: https://github.com/michaljjskawinski/2024-NUTRIM-metabolomics/blob/main/practical-2/web/practical2-instructions.html
8 Session info
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.2.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Paris
date 2024-06-30
pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.4.0)
backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.0)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.4.0)
broom 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)
car 3.1-2 2023-03-30 [1] CRAN (R 4.4.0)
carData 3.0-5 2022-01-06 [1] CRAN (R 4.4.0)
caTools 1.18.2 2021-03-28 [1] CRAN (R 4.4.0)
cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.0)
crosstalk 1.2.1 2023-11-23 [1] CRAN (R 4.4.0)
data.table 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
ellipse * 0.5.0 2023-07-20 [1] CRAN (R 4.4.0)
evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
factoextra * 1.0.7 2020-04-01 [1] CRAN (R 4.4.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
ggpubr 0.6.0 2023-02-10 [1] CRAN (R 4.4.0)
ggrepel 0.9.5 2024-01-10 [1] CRAN (R 4.4.0)
ggsignif 0.6.4 2022-10-13 [1] CRAN (R 4.4.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
gplots * 3.1.3.1 2024-02-02 [1] CRAN (R 4.4.0)
gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
gtools * 3.9.5 2023-11-20 [1] CRAN (R 4.4.0)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
httr 1.4.7 2023-08-15 [1] CRAN (R 4.4.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
KernSmooth 2.23-24 2024-05-17 [1] CRAN (R 4.4.1)
knitr 1.47 2024-05-29 [1] CRAN (R 4.4.0)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.4.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
mclust * 6.1.1 2024-04-29 [1] CRAN (R 4.4.0)
MetabolAnalyze * 1.3.1 2019-08-31 [1] CRAN (R 4.4.0)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
mvtnorm * 1.2-5 2024-05-21 [1] CRAN (R 4.4.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
plotly * 4.10.4 2024-01-13 [1] CRAN (R 4.4.0)
plyr 1.8.9 2023-10-02 [1] CRAN (R 4.4.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.4.0)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
reshape2 * 1.4.4 2020-04-09 [1] CRAN (R 4.4.0)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
rstatix 0.7.2 2023-02-01 [1] CRAN (R 4.4.0)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.4.0)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
xfun 0.45 2024-06-16 [1] CRAN (R 4.4.0)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0)
[1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────

